Australian Web Archive
   HOME

TheInfoList



OR:

The Australian Web Archive (AWA) is an publicly available
online database An online database is a database accessible from a local network or the Internet, as opposed to one that is stored locally on an individual computer or its attached storage (such as a CD). Online databases are hosted on websites, made available as s ...
of archived Australian websites, hosted by the
National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...
(NLA) on its
Trove Trove is an Australian online library database owned by the National Library of Australia in which it holds partnerships with source providers National and State Libraries Australia, an aggregator and service which includes full text document ...
platform, an online library database aggregator. It comprises the NLA's own PANDORA archive, the
Australian Government Web Archive The Australian Web Archive (AWA) is an publicly available online database of archived Australian websites, hosted by the National Library of Australia (NLA) on its Trove platform, an online library database aggregator. It comprises the NLA's own ...
(AGWA) and the
National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...
's ".au"
domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...
collections. Access is through a single interface in Trove, which is publicly available. The Australian Web Archive was created in March 2019, and is one of the biggest
web archive The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC_IA File Format that ...
s in the world. Its purpose is to provide a resource for historians and researchers, now and into the future.


History of the three components

The PANDORA service started archiving websites in October 1996. In 2005, the NLA started archiving annual snapshots of the entire Australian web domain ( URLs with the
suffix In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry ...
. ".au"), collected via large crawl harvests. Later, the earliest websites from the .au web domain, dating back to 1996, were obtained from the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
. In 2019 this content was first made publicly accessible through Trove. The PANDORA infrastructure, which works well for a selective small scale archiving, does not adapt to large scale "bulk harvesting" of web content, so a new technical system had to be developed whereby a web archiving service which would integrate the delivery of archived websites within a live website interface delivering the archived websites seamlessly to the user, which is difficult to achieve technically.


AGWA

Australian Government The Australian Government, also known as the Commonwealth Government, is the national government of Australia, a federal parliamentary constitutional monarchy. Like other Westminster-style systems of government, the Australian Government i ...
websites are Commonwealth records, and are therefore publications to be managed in accordance with the ''Archives Act 1983''. The Australian Government Web Archive (AGWA) consists of bulk archiving of
Commonwealth Government The Australian Government, also known as the Commonwealth Government, is the national government of Australia, a federal parliamentary constitutional monarchy. Like other Westminster-style systems of government, the Australian Government ...
websites. The NLA began regular harvests of the websites in June 2011, after a significant obstacle had been overcome with an administrative agreement made in May 2010 allowing the NLA to collect, preserve and make accessible government websites without having to seek prior permission for each website or document, as was the case before that. The service uses the
Heritrix Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line too ...
web crawler for harvesting,
WARC file The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC_IA File Format that ha ...
s for storage and Open Wayback for delivery of the service. There is a huge amount of publishing by the government, but many challenges to overcome trying to preserve content, such as its sudden disappearance. In March 2014, the AGWA was made publicly accessible. The AGWA meets the preservation and retention requirements for websites as "retain as national archives" (RNA) material under the ''Archives Act''; however
video Video is an electronic medium for the recording, copying, playback, broadcasting, and display of moving visual media. Video was first developed for mechanical television systems, which were quickly replaced by cathode-ray tube (CRT) syste ...
s and document files ( such as
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
s or
Word document Microsoft Word is a word processor, word processing software developed by Microsoft. It was first released on October 25, 1983, under the name ''Multi-Tool Word'' for Xenix systems. Subsequent versions were later written for several other pla ...
s) are not always captured, so must be managed separately. As of early 2015, the AGWA included content dating from 2005, which amounted to about 144 million files occupying 15
terabyte The byte is a units of information, unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character (computing), character of text in a computer and for this ...
s. It only included Commonwealth Government websites collected through bulk harvests of nearly 1000 seed URLs. The scheduling of the harvests was not yet routinely established, but harvests were being conducted roughly three times per year.


Amalgamation

In 2017, the AGWA and the PANDORA archive were amalgamated with the other web archive collections, to form the Trove web archive collection. After further development and the creation of the Australia Web Archive, government websites archived via AGWA and now included in AWA can still be searched separately using the "Advanced Search" option.


Description of AWA

A web archive is described by the NLA as a "collection of snapshots of websites captured while they are accessible on the web, and then preserved in a static copy". The collection archived in the AWA is "relevant to the cultural, social, political, research and commercial life and activities of Australia and Australians". It collects web material via both scheduled archiving of selected websites and publications as well as some
ad hoc Ad hoc is a Latin phrase meaning literally 'to this'. In English, it typically signifies a solution for a specific purpose, problem, or task rather than a generalized solution adaptable to collateral instances. (Compare with ''a priori''.) Com ...
harvesting relating to significant events. As of March 2019, when it began, AWA already contained around 600
terabyte The byte is a units of information, unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character (computing), character of text in a computer and for this ...
s of data, with 9 billion records. It contains more functionality than the
Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...
, hosted by the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
, allowing
full-text search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
ing using a
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
built in-house. The developers also devised techniques to filter out unwanted "noise". The data remains on the Library servers, although a move to the
cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may co ...
is envisaged in the future, as content grows. Usability by a wide range of users, and in particular the search functionality, were major focuses during development. The archive is fully searchable, based on a combination of techniques used by the developers. Each team created a unique and complex
search algorithm In computer science, a search algorithm is an algorithm designed to solve a search problem. Search algorithms work to retrieve information stored within particular data structure, or calculated in the search space of a problem domain, with eith ...
, by adapting a version of
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
’s page ranking algorithm (based frequency of clicks on a page), modified to lead to better, high-quality resources. Other technologies include a Bayesian filter (effectively a
spam filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...
), a
Not Safe For Work Not safe for work (NSFW) is Internet slang or shorthand used to mark links to content, videos, or website pages the viewer may not wish to be seen looking at in a public, formal or controlled environment. The marked content may contain nudity, p ...
classifier from
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
, and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. There is a "Limit to the gov.au web domain" option before searching, and government websites archived via AGWA can still be searched separately using the "Advanced Search" option. Other options in Advanced Search are to limit by timespan of the snapshots, domain and file type. With many of the earlier websites from the 1990s now lost, mainly because of the frequent change of web platforms, the Australian Web Archive is a significant initiative that will help to save current and future web pages, especially Australian content. Material will continue to be added to the Archive, and other online material collected in accordance with the ''National Library Act 1960'', the
legal deposit Legal deposit is a legal requirement that a person or group submit copies of their publications to a repository, usually a library. The number of copies required varies from country to country. Typically, the national library is the primary reposit ...
provisions of the ''
Copyright Act 1968 The copyright law of Australia defines the legally enforceable rights of creators of creative and artistic works under Australian law. The scope of copyright in Australia is defined in the '' Copyright Act 1968'' (as amended), which applies the ...
'' and the NLA's digital collections selection policy.


Asia/Pacific websites

Websites in the
Asia Pacific region Asia Pacific Region can refer to: * Asia-Pacific * WOSM-Asia-Pacific Region * WAGGGS-Asia Pacific Region * Asia-Pacific Economic Cooperation The Asia-Pacific Economic Cooperation (APEC ) is an inter-governmental forum for 21 member economy, ...
are not included in the AWA, but NLA partners with the
Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...
to collect and preserve "selected Asia/Pacific websites related to specific events or socio-political groups".


See also

*
National edeposit National edeposit (NED) is a collaboration between Australia's nine national, state and territory libraries which provides for the legal deposit, management, storage and preservation of, and access to, published electronic material across Aust ...


References


External links

* {{Authority control Archives in Australia Web archiving initiatives Australian digital libraries Online databases